FicTree: a Manually Annotated Treebank of Czech Fiction

نویسنده

  • Tomáš Jelínek
چکیده

We present a manually annotated treebank of Czech fiction, intended to serve as an addendum to the Prague Dependency Treebank. The treebank has only 166,000 tokens, so it does not serve as a good basis for training of NLP tools, but added to the PDT training data, it can help improve the annotation of texts of fiction. We describe the composition of the corpus, the annotation process including inter-annotator agreement. On the newly created data and the data of the PDT, we performed a number of experiments with parsers (TurboParser, Parsito, MSTParser and MaltParser). We observe that the extension of PDT training data by a part of the new treebank actually does improve the results of the parsing of literary texts. We investigate cases where parsers agree on a different annotation than the manual one.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Studying Properties of Czech Complex Sentences from an Annotated Corpus

The paper deals with the problem of an analysis of complex sentences in Czech on the basis of manually annotated data. The availability of a specialized corpus explicitly describing mutual relationships between segments and clauses in Czech complex sentences, together with the availability of a thoroughly syntactically annotated corpus, the Prague Dependency Treebank, provide a solid background...

متن کامل

Two Tectogrammatical Realizers Side by Side: Case of English and Czech

We present a work in progress on a pair of morphosyntactic realizers sharing the same architecture. We provide description of input tree structures, describe our procedural approach on two typologically different languages and finally present preliminary evaluation results conducted on manually annotated treebank.

متن کامل

Designing CzeDLex - A Lexicon of Czech Discourse Connectives

We present a design for a new electronic lexicon of Czech discourse connectives. The data format and the annotation scheme are based on a study of similar existing resources, and we discuss arguments for choosing the data structure and selecting features of the lexicon entries. A special attention is paid to a consistent encoding of both primary and secondary connectives. The data itself comes ...

متن کامل

Anaphora in Czech: Large Data and Experiments with Automatic Anaphora Resolution

The aim of this paper is two-fold. First, we want to present a part of the annotation scheme of the Prague Dependency Treebank 2.0 related to the annotation of coreference on the tectogrammatical layer of sentence representation (more than 45,000 textual and grammatical coreference links in almost 50,000 manually annotated Czech sentences). Second, we report a new pronoun resolution system deve...

متن کامل

An exploitation of the Prague Dependency Treebank: a valency case

The Prague Dependency Treebank (PDT) is a manually annotated part of the Czech National Corpus (Čermák 1997). Its size is approx. 90,000 sentences, i.e. 1.5 million words (tokens). Three layers of annotation (Hajič 2002) are used: the morphological layer, where lemmas and tags are annotated, the analytical layer, which roughly corresponds to the surface (shallow) syntactic structure of the sent...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017